arXiv:math/0503485v3 [math.PR] 5 Jul 2006 


The Annals of Applied Probability 
2006, Vol. 16, No. 2, 685-729 
DOI: 10.1214/105051606000000114 
© Institute of Mathematical Statistics, 2006 


AN APPROXIMATE SAMPLING FORMULA UNDER 
GENETIC HITCHHIKING 

By Alison Etheridge, 1 Peter Pfaffelhuber 2 and 
Anton Wakolbinger 3 

University of Oxford, Ludwig-Maximilian University Munich 
and Goethe-University Frankfurt 

For a genetic locus carrying a strongly beneficial allele which has 
just fixed in a large population, we study the ancestry at a linked 
neutral locus. During this “selective sweep” the linkage between the 
two loci is broken up by recombination and the ancestry at the neutral 
locus is modeled by a structured coalescent in a random background. 

For large selection coefficients a and under an appropriate scaling of 
the recombination rate, we derive a sampling formula with an order 
of accuracy of d?((loga) -2 ) in probability. In particular we see that, 
with this order of accuracy, in a sample of fixed size there are at most 
two nonsingleton families of individuals which are identical by descent 
at the neutral locus from the beginning of the sweep. This refines a 
formula going back to the work of Maynard Smith and Haigh, and 
complements recent work of Schweinsberg and Durrett on selective 
sweeps in the Moran model. 

1. Introduction. Assume that part of a large population of size 2N car¬ 
ries, on some fixed genetic locus (henceforth referred to as the selective 
locus), an allele with a certain selective advantage. If the population repro¬ 
duction is described by a classical Fisher-Wright model or, more generally, 
a Cannings model with individual offspring variance <r 2 per generation, and 
time is measured in units of 2 N generations, the evolution of the fraction 
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carrying the advantageous allele is approximately described by the Fisher- 
Wright stochastic differential equation (SDE) 

(1.1) dP = v /a 2 P( 1 - P) dW + aP( 1 - P) dt, 

where W is a standard Wiener process and s = a/2N is the selective advan¬ 
tage of the gene per individual per generation [6, 8]. 

Assume at a certain time a sample of size n is drawn from the subpop¬ 
ulation carrying the advantageous allele. Conditioned on the path P, the 
ancestral tree of the sample at the selective locus is described by Kingman’s 
coalescent with pair coalescence rate a 2 /P (see, e.g., [14] and Remark 4.6 
below). 

Now consider a neutral locus in the neighborhood of the selective one, 
with a recombination probability r per individual per generation between 
the two loci. From generation to generation, there is a small probability r 
per individual that the gene at the neutral locus is broken apart from its se¬ 
lective partner and recombined with another one, randomly chosen from the 
population. In the diffusion limit considered here, this translates into the re¬ 
combination rate p. Depending on P, only a fraction of these recombination 
events will be effective in changing the status of the selective partner from 
“advantageous” to “nonadvantageous” or vice versa. Given P, the genealogy 
of the sample at the neutral locus can thus be modeled by a structured coa¬ 
lescent of the neutral lineages in background P as in [2]: Backward in time, a 
neutral lineage currently linked to the advantageous allele recombines with 
a nonadvantageous one at rate p( 1 — P), and a neutral lineage currently 
linked to a nonadvantageous gene recombines with the advantageous one 
at rate pP, where p = 2Nr. Moreover, two neutral lineages currently both 
linked to the advantageous allele coalesce at rate cr 2 /P, and two neutral 
lineages currently both linked to a nonadvantageous allele coalesce at rate 

*7(i - p ). 

Two individuals sampled at time t > 0 are said to be identical by descent 
at the neutral locus from time 0 if their neutral ancestral lineages coalesce 
between times 0 and t. This defines an ancestral sample partition at the 
neutral locus at time t from time 0. 

We are interested in a situation in which at a certain time (say time 0) a 
single copy of the advantageous gene enters into the population and eventu¬ 
ally fixates. For large N, the time evolution of the size of the subpopulation 
carrying the advantageous allele can be thought of as a random path X 
governed by an h-transform of (1.1), entering from x = 0 and conditioned to 
hit x = 1. 

The parameters a , p and a 2 , the random path X , its fixation time T and 
the structured n-coalescent in background X from time T back to time 0 
are the principal ingredients in the first part of our analysis. Our central 
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object of interest is the ancestral sample partition at the neutral locus at 
time T from time 0. 

For simplicity we put a 2 = 2. This not only simplifies some formulae, but 
also allows a direct comparison with the results of Schweinsberg and Durrett 
[21], who considered the finite population analog in which the population 
evolves according to a Moran model, leading to a 2 = 2 in the diffusion ap¬ 
proximation. 

We focus on large coefficients a and refer to X as the random path in 
a selective sweep. The expected duration of the sweep is approximately 
(21oga)/a (see Lemma 3.1). Heuristically, the sweep can be divided into 
several phases. Phases that must certainly be considered are: the time inter¬ 
vals which X takes: 

• to reach a small level e (phase 1); 

• to climb from e to 1 — e (phase 2); 

• to fixate in 1 after exceeding the level 1 — s (phase 3). 

Whereas the expected durations of phases 1 and 3 both are x log ct/a, that 
of phase 2 is only x 1/a. The analysis of hitchhiking has in the past often 
concentrated on the second phase [15, 22], For large population size and 
large selective advantage, the frequency path of the beneficial allele in phase 
2 can be approximately described by a deterministic logistic growth curve 
(see, e.g., [17]). However, this approximation is only good for fixed e > 0. 
In [15] the frequency path under a selective sweep is described by a logistic 
growth curve that model phase 2 with e = 5/a (recall that a stands for 
2 Ns), whereas in [22] e = 1/(2 N) is considered. In both cases, e decreases 
with population size. As Barton pointed out in [1], the logistic model fails to 
include the randomness of the frequency path at the very beginning of the 
selective sweep. Consequently he further subdivided phase 1 so as to study 
the onset of the sweep in more detail. 

No matter how the phases of a sweep are chosen, we have seen that the 
first phase as given above takes x log a/a. So, to see a nontrivial number 
of recombination events along a single lineage between t = 0 and t = T , the 
recombination rate p should be on the order of a/ log a. Henceforth, we 
therefore assume 

(1.2) P = l\ -, 0<7<oo. 

log a 

With this recombination rate, it will turn out that, asymptotically as a —> oo, 
effectively no recombinations happen in phase 2, since this phase is so short; 
neither do they occur in phase 3, since then 1 — X is so small. Consequently, 
the probability that a single ancestral lineage is not hit by a recombination 
is approximately given by 

(1.3) 


p = e 7 



4 


A. ETHERIDGE, P. PFAFFELHUBER AND A. WAKOLBINGER 


Since for large a the subpopulation carrying the advantageous allele is 
expanding quickly near time t = 0, a first approximation to the sample ge¬ 
nealogy at the selective locus is given by a star-shaped tree, that is, n lin¬ 
eages all coalescing at t = 0. Hence, ignoring possible back-recombinations 
(which can be shown to have small probability; see Proposition 3.4), a first 
approximation to the sampling distribution at the neutral locus is given by 
a Binomial(n, p) number of individuals stemming from the founder of the 
sweep; the rest of the individuals are recombinants that all have different 
neutral ancestors at the beginning of the sweep (see Remark 2.6 and cf. 
[21], Theorem 1.1). This approximate sampling formula goes back to the pi¬ 
oneering work of Maynard Smith and Haigh [18], who also coined the term 
hitchhiking: the allele which the founder of the sweep carried at the neutral 
locus gets a lift into a substantial part of the population (and the sample) 
in the course of the sweep. Apart from the hitchhikers, there are also a few 
free riders who jump on the sweep when it is already under way and make 
it as singletons into the sample. 

We will see that the sampling formula just described is accurate with prob¬ 
ability 1 — C?(^|^). In a spirit similar to [21], but with a somewhat different 

strategy, we will improve the order of accuracy up to &( n 0 g a yi )- A common 
technical theme is an approximation of the genealogy of the advantageous 
allele in the early phase of the sweep by a supercritical branching process 
[1, 21], an idea which can be traced back to Fisher [9] (see [8], page 27ff). It 
is this early phase of the sweep which is relevant for the deviations from the 
Maynard Smith and Haigh sampling distribution. This higher-order approx¬ 
imation allows for the possibility of the occurrence of recombination events 
during this early phase that affect our sample and thus, potentially, lead to 
nonsingleton families associated to a neutral type other than the original 
hitchhiker. 

Our main result is the derivation of a sampling formula for the ances¬ 
tral partition at the end of the sweep which is accurate up to an error 
0(l/(loga) 2 ) in probability. 

2. Model and main result. 

2.1. The model. 

Definition 2.1 (Random path of a selective sweep). For a > 0, let 
X = (X t )o<t<T be a random path in [0,1] following the SDE 

(2.1) dX = yj2X{\ - X) dW + aX(l - X) coth ^A j dt, 0 < t, 

and entering from 0 at time t = 0. Here, IF is a standard Wiener process 
and T denotes the time when X hits 1 for the first time. 
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Note that 0 is an entrance boundary for (2.1) and that X given by (2.1) 
arises as an /i-transform of the solution of (1.1). Indeed, since the latter has 
generator 

Gf(x) = x(l — x)f"(x) + ax(l — x)f'[x) 

and the G harmonic function h : [0,1] —> [0,1] with boundary conditions 
h( 0) = 0, h(l) = 1 is 


the L-transformed generator is 

G h f(x) = ~G(hf)(x) 

— a;(l — x)f"{x) + ax(l — x) coth(—a:) f'{x)\ 

see also [11], page 245. As described in the Introduction, X models the evo¬ 
lution of the size of the subpopulation that consists of all those individuals 
which carry an advantageous allele (called B) on a selective locus. At time 
t = 0, a single mutant that carries the (then novel) allele B enters into the 
population, corresponding to Xq = 0, and X is conditioned to eventually hit 
1, which happens at the random time T. We will refer to X given by (2.1) 
as the random path of the sweep (or random sweep path for short). 

Definition 2.2 (n-coalescent in background X). Let n > 2. Given a 
random sweep path X , we construct a tree T n = T* with n leaves as follows. 
Attach the n leaves of at time t = T and work backward in time t from 
t = T to t = 0 (or equivalently, forward in time (3 = T — t from (3 = 0 to (3 = T) 
with pair coalescence rate 2/X t [i.e., starting from the tree top ((3 = 0), the 
current number of lineages in T* decreases at rate ^Q)]. Finally, attach 
the root of T r at time t = 0. 

Note that corresponds to a time-changed Kingman coalescent (see 
[12]). This time change transforms the one single lineage of infinite length, 
which can be thought to follow the ultimate coalescence in Kingman’s coa¬ 
lescent, into a single lineage ending at (3 = T. We will refer to the tree T r f 
as the n-coalescent in background X; it describes the genealogy at the selec¬ 
tive locus of a sample of size n taken from the population at the instant of 
completion of the sweep. 

Definition 2.3 (Coalescing Markov chains in background X). Let p > 
0. Given a random sweep path X, let (£/3 )o</3<t be a {B, 6}-valued Markov 
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chain with time inhomogeneous jump rates 


( 2 . 2 ) 


(1 — Xt~p)p = (1 — X t )p from B to b, 
Xt~/ 3 P = X t p from b to B. 


The process £ describes to which type at the selective locus (either B or 
b ) an ancestral lineage of the neutral locus is currently linked in its journey 
back into the past, indexed by the backward time f3. Recall that each neutral 
gene (i.e., each gene at the neutral locus) is linked at any time to a selective 
gene (i.e., at the selective locus), the latter being either of type B or of type 
b, and that for the neutral lineages which are currently in state B, only a 
fraction 1 — Xt of the recombination events is effective in taking them into 
state b. 


Definition 2.4 (Structured n-coalescent in background X; see [2]). Given 
X , consider n independent copies of the Markov chain £ (see Definition 2.3), 
all of them starting in state B at time /3 = 0. Let any two ^-walkers who are 
currently (say at time j3 = T — t) in state B coalesce at rate 2 /Xx-p = 2/X t 
and let any two ^-walkers in state b coalesce at rate 2/(1 — Xt). The result¬ 
ing (exchangeable) system E n of coalescing Markov chains will be called the 
structured n-coalescent in background X. 

We now define a labeled partition V~ n of {1,... ,n} induced by the struc¬ 
tured coalescent E n . 

Definition 2.5 (Ancestral sample partition at the neutral locus). In 
the situation of Definition 2.4, we will say that i and j, 1 <i < j <n, belong 
to the same family if the ^-chains numbered i and j in S n coalesce before 
time P = T. 

1. A family will be labeled nonrecombinant if none of the ancestral lineages 
of the family back to time /3 = T ever left state B. (Thus the neutral 
ancestor of a nonrecombinant family at time t = 0 is necessarily linked to 
the single selective founder of the sweep.) 

2. A family will be labeled early recombinant if none of its ancestral lineages 
ever left state B before the first (looking backward from t = T) coales¬ 
cence in the sample genealogy happened, but if nonetheless the family’s 
ancestor at time t = 0 is in state b. 

3. A family will be labeled late recombinant if at least one of its ancestral 
lineages left state B before the first (looking backward from T) coales¬ 
cence in the sample genealogy happened and if the family’s ancestor at 
time t = 0 is in state b. 
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4. In all other cases (e.g., if two lineages on their way back first leave B, 
then coalesce and return to B afterward), the family will be labeled ex¬ 
ceptional. 

The labeled partition resulting in this way will be called V^ n . 


For large selection coefficients and moderately large recombination rates 
[see ( 1 . 2 )] it turns out that, up to an error in probability of 0 (loga)~ 2 , 
all late recombinant families are singletons, there is no more than one early 
recombinant family and there are no exceptional families. In fact, the proba¬ 
bility that there is an early recombinant family at all is of the order (log a) -1 . 
Given there is an early recombinant family, however, its size may well be a 
substantial fraction of n. Our main result (Theorem 1 ) clarifies the approx¬ 
imate distribution of the number of late recombinants and of the size of the 
early recombinant family. 

2.2. Main result. Recall that "P“ n (introduced in Definition 2.5) de¬ 
scribes the ancestral partition of an n-sample drawn from the population 
at the time of completion of the sweep, where the partition is induced by 
identity by descent at the neutral locus at the beginning of the sweep. 


Theorem 1 (Approximate distribution of the ancestral sample partition). 
Fix a sample size n. For a selection coefficient a» 1 and a recombination 
rate p obeying ( 1 . 2 ) for fixed 7 , the random partition V^ n introduced in 
Definition 2.5 consists, with probability 1 — 0{(\ogot)~ 2 ), of the following 
parts: 


• L late recombinant singletons. 

• One family of early recombinants of size E. 

• One nonrecombinant family of size n — L — E. 


More precisely, a random labeled partition o/{l,...,n}, whose distribution 
approximates that of V~ n up to a variation distance of order 0 ((log a) -2 ), 
is given by random numbers L and E constructed as follows: 

Let F be an N -valued random variable with 


(2.3) 


P[F<i] 


(i -(n- !))••• (i- 1 ) 
(* + (n — 1 )) • • • (i + 1 ) ’ 


and, given F = f, let L be a binomial random variable with n trials and 
success probability 1 — pf, where 



(2.4) 
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^- early phase -- late phase - 
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O 
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■ HE late recombinants □ early recombinants Q non-recombinants 

Fig. 1. 


Independently of all this, let S be a {0,1,... ,n}-valued random variable with 


(2.5) 


P [S = s] = < 


n —1 

7 n y- 1 

log a “ i ’ 

7 n 1 

log a s(s — 1 ) ’ 
7 n 1 

log an — l" 


s = 1, 


2<s<n-l, 
s = n. 


Given S = s and L = l, the random variable E is hypergeometric, choosing 
n — l out of n = s + (n — s), that is, 


( 2 . 6 ) 


P [E = e\ 


0L n -7. e ) 

L-i) 


Sections 3 and 4 will be devoted to the proof of Theorem 1. Figure 1 
explains the concepts which appear in our theorem and points to the strategy 
of the proof as explained in Section 3.2. In the figure, the sample size is 
n = 7, and the x’s indicate effective recombination events that occur along 
the lineages. The early phase ends when the number of lines in the sample 
tree has increased from six to seven. At the end of the early phase there is 
one family of size S' = 3 of early recombinants. One member of this family is 
then kicked out by a late recombination. In the sample there are L = 2 late 
recombinant singletons, one early recombinant family of size E = 2 and one 
nonrecombinant family of size 3. 

Remark 2.6. (a) From (2.5) we see that 
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and 


P[S > 0] 


n —1 

7 n y- l 

log a — i 

° i=i 


In particular, this shows that the probability that there are early recombi¬ 
nants at all is O ( 1 ^ 777 ) • 

(b) Barton [1] reported simulations in which several nonsingleton recom¬ 
binant families arise. This is not, in fact, incompatible with our theorem 
since the constant in the O in the error estimate of Theorem 1 depends 

on (yre) 2 . [See, e.g., Section 3.4, where for each pair in our re-sample we 

2 

encounter an error C ^ a yi •] In, for example, the simulation described on 
page 130 of [1], in which eight early recombinant families are seen, this fac¬ 
tor yn is ~ 120, while log a ~ 13, which explains the occurrence of several 
nonsingleton recombinant families. 


As a corollary to Theorem 1 we obtain an approximate sampling formula 
under the model for genetic hitchhiking. This means that we can now derive 
the probability of having l late recombinants (which produce singletons), e 
early recombinants, which form a family of size e, and n — l — e lineages that 
go back to the founder of the sweep and also form a family on their own. 


Corollary 2.7 (Approximate sampling formula). Under the assump¬ 
tions of Theorem 1 the common distribution of the number of early re¬ 
combinants E and the number of late recombinants L is, with probability 
1 — 0 ((loga) -2 ), given by 


p [E = e,L = l] 


(2.7) 


= E[p n F - l (l-p F y]-{ 


717 


(re-l)Q: 2 )l{l + e = re} + (V) 


logct 

ny 
log a 


e(e- 1 ) 

1 {l + 1 = re} 


e > 2 , 


, \ n —1 -1 n ( n—s \ 

+ i”7M£) + £7f 

i =2 s =2 b 


1 - 


l 

rey 


e = 1 , 


l 


+ 


log a 
rey / 1 


log a \ re 


I 1 n T^ 

t{l = re} 




+ E 


s=2 


1 


re — s 

'-^\n-l ] s(s - 1 ) r 


e = 0 . 
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Necessarily given L = l and E = e, the number of lineages going back to the 
founder of the sweep is n — l — e. 

This corollary will be proved in Section 4.5. 

2.3. Comparison with Schweinsberg and Durrett's work. Our research 
has been substantially inspired by recent work of Schweinsberg and Durrett 
[4, 21]. Let us point out briefly how the main results of [21] and of the present 
paper complement each other. 

Schweinsberg and Durrett [21] considered a two-locus Moran model with 
population size 2 N, selective advantage s of the advantageous allele and 
individual recombination probability r = 0(1/log IV). Their main result is 
(in our terminology) about the approximate distribution of the ancestral 
distribution of an n-sample at the neutral locus as N —» oo. In preparing 
their Theorem 1.2, Schweinsberg and Durrett [21] specified (in terms of a 
stick-breaking scheme made up by a sequence of Beta variables) a random 
paintbox with parameters L = [2./VsJ and r/s. They denoted the (labeled) 
distribution of an re-sample drawn from the paintbox (where the class be¬ 
longing to the first draw is tagged) by Q r / S ,L- The assertion of their Theo¬ 
rem 1.2 then is that Q r / Si L approximates the ancestral sample distribution 
at the neutral locus with probability 1 — 0(l/(logIV) 2 ). Notably, s remains 
fixed, that is, does not scale with N. 

A priori, this strong selection limit does not lend itself to a diffusion 
approximation. However, interestingly enough, certain aspects do: in par¬ 
ticular, the ones studied in the present context. More precisely, our results 
show that the approximate distribution of the ancestral sample partition in 
the strong selection limit of [ 21 ] arises also in a two-stage way, first passing 
to the diffusion limit and then letting the selection coefficient tend to infin¬ 
ity. This is made precise in the following proposition, which will be proved 
at the end of Section 4.4. 

Proposition 2.8. Let Q r / S ,L be as in [21], Theorem 1.2. Then, with 
the choice 

(2.8) a = 2Ns, p = 2Nr, 7 =-log cc, 

s 

the distribution specified in Theorem 1 and further described in Corollary 
2.7 approximates Q r / S: L U P to an error of 0(1/(log N) 2 ). 

Hence our results (Theorem 1 and Corollary 2.7) also give an approximate 
sampling formula for the random partition that appears in Theorem 1.2 of 
[ 21 ] and enters as an input to the coalescent with simultaneous multiple 
mergers described in [5]. In particular, our results reveal that this random 
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partition has more than one nonsingleton class with a probability of only 
0(l/(loglV) 2 ), a result which is less explicit in the proofs of [21]. 

Let us emphasize once again that in [21] the error is controlled in a specific 
“large but finite population” model, whereas in our approach the error is 
controled after having performed a diffusion approximation. Proposition 2.8 
together with Theorem 1.2 of [21] reveals that our diffusion approximation 
has the order of approximation 0(l/(log -/V) 2 ) in the strong selection limit of 
the Moran model. This might be seen as one more indication for the strength 
and robustness of the diffusion approximation in the context of population 
genetics. 

Numerical results. One can now still ask how large are the constants 
which are lurking behind the 0’s. To shed some light on this and to see how 
well our approximations perform, let us present some numerics. We compare 
the approximation of Theorem 1 with numerical examples given in [21]. The 
examples deal with samples of size n = 1 and n = 2. We distinguish the 
number and types of ancestors of the sample at the beginning of the sweep. 
For a single individual the probability that the ancestor is of type b (an 
event called pinb in [21]) can be approximated by 

pinb « P [L = 1], 

as there is no early phase in our theorem in this case. For a sample of size 2, 
either there are two ancestors and both have type b (denoted p2inb), there 
are two ancestors, one of type B and one of type b (denoted plBlb), or there 
is one ancestor with either a b allele (denoted p2cinb) or a B allele (which 
happens in all other cases). Using Theorem 1 we approximate 

p2inb « P[L = 2 or S = 2, L = 1], 
p2cinb « P[L = 0, S = 2], 
plBlb « P[L = 1,5 = 0]. 

In [21] simulations were performed for three models: (i) in a Moran model, 
(ii) in a model where the frequency of the B allele follows a determinis¬ 
tic logistic growth curve, and (iii) for the approximate result obtained in 
[21], Theorem 1.2. Results for a more extensive range of parameters can be 
found in [4]. In Table 1, we have added the approximations of our Theo¬ 
rem 1 to those of [21]. In all cases we find that the approximation given by 
our Theorem 1 performs comparably to the approximation of Schweinsberg 
and Durrett. Both approximations are significantly better than the logistic 
model. 

3. Outline of the proof of Theorem 1. We start by calculating the ex¬ 
pected duration of the sweep. 
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Table 1 

Numerical results comparing Theorem 1 with Theorem 2 from [21] and a logistic model. 
The numbers in brackets are relative errors with respect to the Moran model 



pinb 

p2inb 

p2cinb 

plBlb 


n= 10 4 

S S 0.1 

r = 0.001064 


Moran 

0.08203 

0.00620 

0.01826 

0.11513 

Logistic 

0.09983(21%) 

0.00845(36%) 

0.03365(84%) 

0.11544(0.3%) 

DS, Thm. 2 

0.08235(0.4%) 

0.00627(1.1%) 

0.01765(—3.4%) 

0.11687(1.5%) 

Thm. 1 

0.08249(0.6%) 

0.00659(6.3%) 

0.01867(2.2%) 

0.11515(0.0%) 


n = 10 4 

s = 0.1 

r = 0.005158 


Moran 

0.33656 

0.10567 

0.05488 

0.35201 

Logistic 

0.39936(18%) 

0.13814(31%) 

0.09599(75%) 

0.32646(—7.3%) 

DS, Thm. 2 

0.34065(1.2%) 

0.10911(3.2%) 

0.05100(—7.1%) 

0.36112(2.6%) 

Thm. 1 

0.32973(—2.0%) 

0.10857(2.7%) 

0.05662(3.2%) 

0.34157(—3.0%) 


3.1. The duration of the sweep. Let T$ be the time at which X reaches 
the level 5 for the first time. 

Lemma 3.1. For all fixed e £ (0,1), asa—^oo, 

(3.1) E[T £ ] = l -^ + o(-), E[T-T 1 _ £ } = 1 -^ + o(-), 

a \a/ a \a J 

(3.2) E[Ti_ e -T e ] = oQ), 

(3.3) Var [T] = o(J-}. 

This lemma will be proved in Section 4. Notice in particular that E T = 
2 log a/a + 0{a~ 1 ). Thus, to see a nontrivial number of recombination events 
along a single line between t = 0 and t = T (/3 = T and f3 = 0), the recom¬ 
bination rate p should be on the order of a./ log a. Henceforth, we therefore 
assume that p obeys equation (1.2). 

3.2. Three approximation steps. In a first approximation step we will 
show that all events (both coalescences and recombinations) that happen 
along the lineages of the structured coalescent while dwelling in state b have 
a negligible effect on the sampling distribution. Thus once a lineage has 
recombined away from state B, we can assume that it experiences no further 
recombination or coalescence events in the remaining time to (3 = T. This 
motivates us to couple the structured n-coalescent (at the neutral locus) with 
the n-coalescent (at the selective locus), and to study the ancestral partition 
at the neutral locus by marking effective recombination events that happen 
along the selective lineages. 
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Definition 3.2. For a given sweep path X , consider the coalescent T n 
in background X together with a Poisson process with intensity measure 
(1 — X t )pdt along the lineages of T n . Say that two leaves of T n belong to the 
same family if and only if the path in T n which connects them is not hit by 
a mark. Call a mark early if it occurs between time 0 and the time when T n 
increases from n — 1 to n and call it late otherwise. Label a family as early- 
marked, ( late-marked ) if it traces back to an early (late) mark; otherwise 
label it as nonmarked. In this way we arrive at what we call the labeled 
partition P . 

Note that the nonlabeled version of V 7n arises from the marked tree 
T n in the same way as the sample partition in the infinite-alleles model 
emerges from the marked coalescent. Also note that late-marked families in 
V 7n are necessarily singletons. It will turn out (see Corollary 3.5) that V Tn 
approximates V = ‘ n with probability 1 — (D((loga)~ 2 ). 

The second approximation step consists of replacing V 7n by a labeled 
partition generated by a marked Yule tree. 

Definition 3.3. Let y be an (infinite) Yule tree with branching rate 
a and let y n be the random tree which arises by sampling n lineages from 
Y (which come down from infinity). Up to the time when the number of 
lines extant in Y reaches the number [ctj, mark the lines of y n by a Poisson 
process with homogeneous intensity p = ya/loga. Families, early and late 
marks, early-marked families and so forth are specified in complete analogy 
to Definition 3.2. The resulting labeled partition of {1, ..., n} will be denoted 
by V^ n . 

Again, we will show (see Proposition 3.6) that V^ n approximates V 7n 
with probability 1 — 0((loga) -2 ). 

The third approximation step exploits the fact that the probability for 
more than one early mark is 0((loga) -2 ). The random variable F specified 
in Theorem 1 stands for the number of lines extant in the full tree y at the 
time of the most recent coalescence in the sample tree y n . The number pf 
is the approximate probability that, given F = /, a single lineage is not hit 
by a late mark (or equivalently, does not experience a late recombination); 
note that this probability is larger than the probability p given by (1.3). 
The number of late-marked families (corresponding to late recombinants) is 
approximated by a mixed Binomial random variable L with random success 
probability 1—pp. In the dominating case that y n in its early phase (when 
it has less than n lines) is hit by at most one mark, the random variable S 
approximates the size of the early-marked family which arises if we “cut off” 
y n at the time of its most recent coalescence (i.e., when its number of lines 
increases from n — 1 to n). This size is thinned out by late marks; in other 
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words, the final size of the early-marked family arises as a hypergeometric 
random variable, by randomly distributing the n — L lineages which have 
not been knocked out by late marks onto the S + (n — S) potential ancestors 
at the most recent coalescence time of y n . 

3.3. First step: From the structured to a marked coalescent. The key 
result for the first approximation step is the following: 

Proposition 3.4. (i) The probability that a neutral ancestral lineage of 

our sample recombines out of B and then recombines back into B and (ii) 
the probability that a pair of neutral ancestral lineages coalesces in b are both 

The previous proposition allows us [within the accuracy of (9( n~wz )] to 
dispense with the structured coalescent and work instead with the marked 
coalescent in background X as described in Definition 3.2. Indeed, the fol¬ 
lowing statement is immediate from Proposition 3.4. 

Corollary 3.5. The variation distance between the distributions of 
and V rn is <D{- { j^r). 

3.4. Second step: From the marked coalescent to a marked Yule tree. A 
key tool will be a time transformation which takes the random sweep path 
into a (stopped) supercritical Feller diffusion. Because the early phase of the 
sweep is the most relevant one and because a Fisher-Wright diffusion enter¬ 
ing from zero looks similar to a Feller diffusion entering from zero as long 
as both are close to zero, we will be able to control the error in the sample 
partition that results from replacing X by the path of a Feller diffusion. 
Under this time transformation the mark (recombination) rate will become 
a constant. By exchangeability, we can (without loss of generality) sample 
only from individuals in our Feller diffusion with infinite line of descent (pro¬ 
vided of course that there are enough such) and it is well known that such 
“immortal particles” form a Yule tree with branching rate a [7] . This means 
that we sample from a Poisson number of individuals with parameter a, but 
we shall see that it suffices to consider the Yule tree stopped when it has 
precisely [aj extant individuals. 

Proposition 3.6. Let y, y n and be as in Definition 3.3. The 
variation distance between the distributions of V 7 ™ and V^ n is &( (i 0 g a yj )- 
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3.5. Third step: Approximating sample partitions in marked Yule trees. 
Because of Corollary 3.5 and Proposition 3.6, the proof of Theorem 1 will 
be complete if we can show that the representation given there applies, 
within the accuracy of 0( to the random labeled partition V^ n . 

Thus, the remaining part of the proof takes place in the world of marked Yule 
processes, where matters are greatly simplified and many exact calculations 
are possible. 

Let I = I(t) be the number of lines of y extant at time t and let Ki be 
the number of lines extant in y n while I = i. The process K = (LQ) will play 
a major role in our analysis below. Viewing the index i as time, referred to 
below as Yule time , we will see that K is a Markov chain. 

We denote by Mj the number of marks that hit y n while y has i lines. 
Since the latter period is exponentially distributed with parameter ia and 
marks appear along lines according to a Poisson process with rate 7 a/log a, 
we arrive at the following observation: 

Remark 3.7. Given K { = k, Mi is distributed as G — 1, where G has a 
geometric distribution with parameter 

ia 1 

ia + kja/ log a 1 + (fc/ijy/loga 

Consequently, the conditional expectation of Mi given Kj = k is j . 

We will distinguish two phases of the process y. The early phase will 
consist of all Yule times i when Ki <n and the late phase will consist of the 
Yule times i with Ki=n. 

We define 


F := min{z: Ki = n}, 

that is, F is the number of lines in the full tree Y when the number of lines 
in the sample tree y n reaches its final size n. This is when the late phase 
begins. 

Proposition 3.8. The distribution of F is given by (2.3). 

By analogy with Definition 3.2, we call those marks which hit y n in the 
early (late) phase the early (late) marks. 

The labeled partition V^ n introduced in Definition 3.3 is generated by 
early and late marks. Let us treat the early marks first. We will find in 
Proposition 3.9 that up to our desired accuracy, there is at most one early 
mark. 
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We write 

M= J2 M i 

i: i<F 

for the number of early marks. Let us denote by S ^ the number of leaves in 
y n whose ancestral lineage is hit by an early mark. On the event {M = 1}, 
that is, in the case of a single early mark, the leaves of T n are partitioned 
into two classes, one (of size S^) whose ancestry is hit by this single early 
mark and one whose ancestry is not hit by an early mark. 

The next proposition gives an approximation for the joint distribution of 
(M, S y ). 


Proposition 3.9. Up to an error of 0( n 0 g a \s ) 


(3.4) P[M = l,S y = s} = < 


71 — 1 -i 

n l y- 1 

logo ^ k' 
ny 1 

logct s(s — 1 ) 
ny 1 

_ log an— 1 ’ 


Furthermore, 

(3.5) 


P[M >2]=0 


s = l, 

2 < s < n — 1, 


s = n. 


(log a 


i 2 r 


For fixed / < |_aj, the probability that a randomly chosen line is not hit 
by a mark between i = f and i = |_ctj, is (cf. Remark 3.7) 


L«J 


1 


'^ f l + (l/i)'y/loga 


= exp 


7 


(3.6) 


= exp 


' H / 

5Z lQ S 1 - 1 ■ , « 

y logo; i + 7 /ioga 

- j - )+0 

log a 1 + 7 /log a J 


(log a) 


= Pf + O 


1 


(log a) 


where pf was defined in (2.4). The last step follows from the Taylor expan¬ 
sion 


1 11 ^ 
T-7:- — V + TnO 

* + 7 /ioga i 


1 


log a 


Write L? for the number of lineages in y n that are hit by late marks. 
For distinct lineages of y n , the events that they are hit by late marks are 
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asymptotically independent, which allows us to approximate the distribution 
of L y . 

Proposition 3.10. The distribution of L y is approximately mixed Bi¬ 
nomial. More precisely, 

p & -- £ (?)-p/)'pu-/i+ 

(3.7) /= " 

= (;)e^(1- W )-‘]+0(^T ? ), l — 0,... ,n. 

Based on the previous two propositions, we will be able to show that up 
to our desired accuracy the random variables S y and L y can be treated as 
independent. 

Proposition 3.11. The random variables S y and L y are approximately 
independent, that is, 

P[S y = s,L y = l} = P[5 y = a] • P [L y = l} + o( n 1 , 2 ") . 

Vlloga) 2 / 

Given M = 1, S y = s and L y = l, the size (call it E y ) of the (single) 
early-marked family in V yn (see Definition 3.3) is hypergeometric, choosing 
n — l out of two classes, one of size s, the other of size n — s. Hence from 
Propositions 3.9, 3.10 and 3.11 the labeled partition V yn consists, with 
probability 1 — 0((logo:)~ 2 ), of L late-marked singletons, one early-marked 
family of size E and one nonmarked family of size n — L — E, where the 
joint distribution of ( L,E ) is specified in Theorem 1. Combining this with 
Proposition 3.6 and Corollary 3.5, the proof of Theorem 1 is complete. 

4. Proofs. 


4.1. The duration of the sweep: Proof of Lemma 3.1. We use standard 
theory about one-dimensional diffusions. (See, e.g., [8] and [16].) Primarily 
we need the Green’s function G(-, •) that corresponds to the solution of the 
SDE (2.1), this time with Xq = i£ [0,1]. 

The Green’s function satisfies 


(4.1) 



[ 1 G(x,0g(0dt 

Jo 


where E^f-] is the expectation with respect to the process X started in x 
(and E[-] = E 0 [-]). 
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If X is a solution of (2.1), then (see [16], Chapter 15, formula (9.8)) 


(4.2) 


x<£:G(x,€) 

x>£:G(x,£) 


(1 - e -“( 1 ~0)(i _ e -<*) 

<( l-0(l-e-“) ’ 

(e~ ax - e- a )(e“« - 1)(1 - e -0 *) 
a£(l - £)(1 - e-“)(l - e ~ ax ) 


Observe that G(x,£) is decreasing in an 


Proof of (3.1) AND (3.2). In the proofs there will appear some con¬ 
stants C, C' which might change from occurrence to occurrence. 

Observe that 

(4.3) E[T e ] = E[T] — E e [T] = f G(0,£)d£- G(e,Z)d£, 

Jo Jo 

where we have used that G(0, £) = G(e,£) as long as £ > e. Since 

1 1 1 
W^~l + 

and using the symmetry in G(0,£) = G(0,1 — £), we see that for the first 
term in (4.3), 


G(0,0de- 


loga 1 / [ £ f 1 (1 — e “^)(1 —e ^) 


a 


+ 

a V4o Ji-e 


= -(C + 
a 


= -lC- 
a 


(l-e- a )C 

“ £ (l-e“ ? )(l-e" Q+? ) 

£ 

ae g-? _|_ g-"+? — P ~ a 


d£ — log a 

r a£ 1 

dc ~L ^ 

i 




For the second term, as 1 — e < 1 — e for £ < e, 


/*£ g /* 

/ G(e,Z)dt<- r —-r 
Jo all — e) Jo 


£ e a ^ — 1 e" 

di = 


< 


rae e € _ 1 

a(l -e) Jo £ a(l — e) Jo £ 

Ce~ a£ 


d£ 


a 


1 + 




A similar calculation leads to the second statement of (3.1), and from these 
two equalities and as E[T] = E[Ti/ 2 ] + E[T — Ti/ 2 ], also (3.2) follows. □ 


Proof of (3.3). To compute the variance of T we use the following 
identity, which is a consequence of the Markov property (and can be checked 
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by induction on k): 

i-T ,-T r T 


(4.4) 


[ [ ■■■ [ 9k(X tk ) ■ ■ ■ gi(X tl ) dt k ■ ■ ■ dti 
Jo Jti Jtk-i 

= f ••• [ G(x,xi) ■ ■ ■ G(x k -i,x k )g 1 (xi) ■ ■ ■ g k (x k ) dx k ■ ■ ■ dxi. 
Jo Jo 

obtain 

Var[T] = 2 I" [ 1 G(0,OG(t,r,)dr ] dt-2 f 1 ['G^OGMdydt; 

Jo Jo Jo J f 


/ 0 Jo 

From this we obtain 

f*i r i 


= 2 


f'f 

lo Jo 
2 


G(0,S)Gfori)dTidZ 


a 2 (l — e a ) 2 

f 1 ft (1 - e - Q M)( e -a£ -e~ a ) {e ari - 1)(1 -e~ aT1 ) 


/0 Jo 


2e~ a 


€(i — 0 viX-v) 

r-l rl-f g«€ _l e <*V_l 


dgd £ 


a 2 (l — e a ) 2 Jo Jo £(1 -£) Vi 1 ~ V) 
2 


drjd £ 


a 2 ( 1 — e a ) 2 


-a/2 


1/2 e a? _ j ^2 

dt, 


+ 2 


io e(i - o 

rl/2 r£ e -o? _ g-a e a?7 _ j 

£(i-£) vX~v) 


dgd^ 


Jo Jo 

by a decomposition of the area {(£,??):?? < 1 — £} in {(^, 77): 77 < 1/2}, 

{(£, v)’-Z< 1 A 1/2 < V < 1 - £} and {(£, 77 ) : £ > 1/2, rj < 1 - £}, and the 
symmetry of the integrand. From this we see 


Var[T]<— [C + 


of 


r-l/2 /■£ g-o^ _ g-a gar; _ ^ 


10 JO 


and (3.3) follows as 

rl/2 r£ g-«? _ g-“ g«»7 _ \ 


Jo Jo 


£ 




£ 

dgd^ 


dgd^ 


^ I 10 / 2 e ^ — e “ e 71 — 1 


V^avJo £ 7o V 

ra/2 g-f _ g-a / /T gT - 1 


dgd£ 


< 


£ 


'1 ?? 


dg + cjd£ + C' 


<C j* ^d£ + C' = 0{\). 
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Here we have used / 0 “ /2 ^ dp dt, = dpdt + j\ n/2 ( f 0 ‘ dr] + jf dp) dt to 
obtain the penultimate and Jj’ dr] = jf'' 2 dr] + f^ 2 drj to obtain the last in¬ 
equality. □ 

4.2. From the structured to a marked coalescent. 


Proof of Proposition 3.4. (i) Given (Xt)o <t<T, we are looking for 

the probability that tracing backward from time T to time 0 a lineage escapes 
the sweep (which happens with rate p( 1 — Xt)) and then recombines back 
into B (which happens then with rate pXt). The required probability then 
follows by integrating over path space and is given by 


E 


1 —expf— / pX s ds ) )/?(! — X t ) exp I — / p(l — X s ) ds ) dt 


(4.5) 


<p 2 E 


[ T { 1-X t ) f X s ds dt 

J o Jo 


<p 2 f 1 f^G(0,t)G(t,p)dpdt 
Jo Jo 

+ P 2 [ [ G(0,t)G(0,p)(l-p)tdpdt, 
Jo J$ 


where we have used that G(t, rj) = G(0, rj) for t < rj and (4.4). The first term 
is Var[T] = 0( ^ o } a ^ ) by (3.1). The second term gives 

p 2 r 1 r 1 (l-e _Q? )(l-e _Q(1_ ^) 

a 2 (l — e~ a ) 2 Jo -4 1 — t 




V 


< 


P 


~ 


1 


a 2 Jo Jo (1 - O n 

log(l - rj) 


dt dr] = 


1 r 1 li 


4/ 

or Jo 


V 


dp = 0 


a 2 Jo Jl-r] t T] 


1 


dt, dp 


(logo) 5 


and we are done. 

(ii) To prove the second assertion of the proposition, we split the event 
that two lineages coalesce in b into two events. Recall that T$ denotes the 
time when X first hits 6. Whenever two lineages coalesce in b, then either 
there must have been two recombination events between ^1/2 and T or there 
must have been a coalescence in b between 0 and ^1/2- Both events only have 
small probabilities as we now show. 
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First consider the event that two recombinations occur in [T) / 2 , T]. We 
see here as in (4.5) that the probability for this event is at most 


p 2 e 1/2 


[ T [\l-X t ){l-X s )dsdt 
Jo Jo 

P 2 / ' f 1 G(lOG($,r,)(l-r,)(l-$)d7,dt 
Jo Jo 


<p 


2 f ' f ^G(0,OG^,v)drjd^ 


+ P 


+ P 


o Jo 

2 


1/2 ,1 


/0 


-0( l ~v)dvd£, 


cl cl 


1 1/2 Jl/2 




2 

where we have used that G(x, £) is decreasing in x. The first term is Var[T] 
0( (log^)^ )• The second is bounded by 


j2 ,1/2 cl e -a/2( e a£ _ 1) 1 _ e ~ar, 


oi 2 Jo 


drjdE, 


Ji £ V 

o;/2 pcx, ya/2A?7 g£ _ ^ — g T) 


< 


(log a) 

y2g— a /2 

(log a) 2 


JO ^0 


c + 


£ 

,r(Ao/2 


?7 


P 

dr ]) = 0 


di dr] 


(log at)‘ 


where we have split the integral /“ drj in /, dr] + /“^ 2 dr] to obtain the final 
estimate. The third term is small, as it is the square of 

pf 1 G(0,O(l — I" -di = o(-^—). 

J 1/2 loga7i/2 4 V log ol ) 

The second event we have to consider is coalescence in b between time 0 and 
Ti/ 2 - The probability for this event is, again as in (4.5), at most 

rV 2 2 

/ G(0,0—de 


E 


' 1/2 


uo 


1-X 


dt 


< E 

2 


LJO 


1-X 
1/2 1 


l(X<r ]dt 


l-£ 


_ 8 W 2 j _ e -g 

~ a Jo £(1 —£) 2 ~aJo £ 


<V + l« g a)=o(^). 

a \ a J 

So both events are improbable and Proposition 3.4(h) is proved. □ 
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4.3. From the coalescent to the Yule tree. 

Proof of Proposition 3.6. We have to show that the error we make 
when using Yule trees instead of coalescent trees in a random background is 
small. This involves two approximation steps. The first is an approximation 
of the coalescent in the random background of a Wright-Fisher diffusion by 
a coalescent in an a-supercritical Feller background. The second step is an 
approximation of the latter coalescent by Yule trees. 

For the first approximation step we need to time-change our coalescent. 
This relies on the following proposition, whose proof consists of an applica¬ 
tion of [6], Chapter 6, Section 1. 

Proposition 4.1. Under the random time change i e-> r given by dr = 
(1 — Xf)dt, the random path X = ( Xt)o<t<T is taken into a random path 
Z = ( Z T ) Q<r< f, which is an ot-supercritical Feller diffusion governed by 


dZ = \[TZ dW + aZ dr, 


starting in Zq = 0, conditioned on nonextinction and stopped at the time T 
when it first hits 1. 

Under the time change t i—► r, the n-coalescent T n described in Defini¬ 
tion 2.2 is taken into the n-coalescent C n whose pair coalescence rate con¬ 
ditioned on Z is ^ (i 2 _ z ) dr. Under this time change, the marking rate 

p{ 1 — X t ) dt becomes pdr. Let V Cri be the sample partition generated along 
C n in the same way as V^ n was generated along T n , but now with the uniform 
marking rate pdr. Note that V Cn and have the same distribution. 

Let us denote by T> n the n-coalescent whose pair coalescence rate condi¬ 
tioned on Z is 2 / Z T dr, T >t> 0. 

Proposition 4.2. The labeled sample partitions V Cn and V Vn generated 
by a marking with rate pdr along C n and T> n , respectively, coincide with 


probability 1 — 0( X -^-). 


We need a lemma for the proof of this proposition. Denote by T.fi the 
time when Z hits level e for the first time and denote by Tfr n the time when 
the number of lines in C n increases from n — 1 to n. 


Lemma 4.3. Assume a is large enough such that (log a ) 2 > 2. Let e(a) = 
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Proof. Our proof rests on a Green’s function calculation analogous 
to those in the proof of Proposition 3.4 and so, since we already have an 
expression for the Green’s function for the process X, it is convenient to 
“undo” our time change. If we denote by T* the time when X hits level 
e for the first time and denote by T^ n (resp. Tj n ) the time of the first 
coalescence in C n (resp. T n ), then 

Po K z {a) < T c c -\ = P 0 [T* (a) < rj n \. 

It is enough to consider the coalescence time of a sample of size 2, because 
as the probability that any pair in the sample coalesces is bounded by the 
sum over all pairs of lineages. 

With (4.2), 


PoK X (a) <T^] = l-V 


e{a) 


exp 


f! 


< E e(a) 


TfT T d ‘ 

0 A s 


= / G(e(a),f)?d£ 
Jo c 


£ 

Dehne g(a) := (loga) 2 . We split the last integral into three parts. We have, 
for constants C which change from line to line, 


M<x) 2 

/ q G{e(aU)-d£ = 2 


< 


fe(a) ( e -ff(a) _ e~ a )(e a € - 1)(1 - e-°*) 
o a£ 2 (l — £)(1 — e - 9( Q ))(l — e~ a ) 


di 


2 e -9(“) 


M a ) (e ? - 1)(1 -e -5 ) 


(l_e-ff(a))(l _ e ( a )) i / 0 £ 2 

/ , , / w„ [9{a)/2 rg(a) l \ 

< C (e~ 9(a) + / di + / —,dA 

V J 1 Jg(a )/2 ) 


d£ 


< 


C 

9(a) ’ 


r G(e(aU)U = * A (1 7 « 

4(a) '£ J e{ol) a£ 2 (l-£)(l~e “) 

/■“/2 1 4 

<4 / — < 


9(a) £ 2 9(a)’ 

G{e{a)^)- F di<Aj l G(0,Z)d£<?^ 
J 1/2 c J 1/2 a 


as 9 (a) > 2 and so e(a) < where we have used (3.1) in the third term. 


□ 
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Remark 4.4. It is immediate from Lemma 4.3 that still writing e(a) = 
( loga ) we will have 

a 


(4.7) 


Po [T e 


4«) 


< T? n ] = O 


(log a) 


Proof of Proposition 4.2. Let be as in Lemma 4.3. Looking 
backward from the time , assume we take the marked n-coalescent with 
pair coalescence rate as an approximation for the marked n-coalescent 
with pair coalescence rate z ^‘_ z \ j ■ Sources of error in constructing the la¬ 
beled partition are the recombination events that occur at a time when the 
two coalescents have different numbers of extant lines. We will call such an 
event a bad recombination event. 

First we couple the two coalescent trees. We write T z , T z ^_ z -< for the 
times at which the coalescents with rates 2/Z and 2/Z(l — Z ) per pair, 
respectively, have a transition from k to k — 1 extant lineages. We shall call 
our coupling successful if we have 

rpn \ rpn \ rpn — 1 \ rpn — 1 \ \ rpk \ rpk \ \ rp2 \ rp2 

1 Z(1-Z) > 1 Z > 1 Z{1-Z) > 1 Z > • • • > 1 Z(1-Z) > 1 Z > ' ' ' > 1 Z(1-Z) > 1 Z' 


Let Si ,..., S n be independent exponentially distributed random variables 
with Sk ~ exp((2)). The idea is simply to use the same random variable Sk 
to generate the fcth coalescence for both processes. Thus, writing V for Z 
or Z(1 — Z), Ty,... , Ty are defined recursively by 



S n and 


f T v +1 2 

/ 77 ds = S, 

Jt$ V s 


k = 1 ,... ,n — 1 . 


Notice that our first inequality, T z ^_ z ^ > T z , is automatically satisfied. At 
time T z , l _ z ^ we have 


Thus 


f E(a) -^-ds> (1 -e(a))S n 

JT n 

Z(l-Z) s 


P[T"- 1 _^>T^] = P 


r rT z 
/ £(-) 

T— ds < 

rT z 
f e(cc) 

ds 

1 rpn—1 

1 Z{l-Z) 

Z s 

1 rpn —1 

1 z 

Zs 


< P[(l — s{a)){S n -i + S n ) < 5 n ] 

= 0(e{a)). 

Suppose then that T z 7 } ] _ z ^ < T z . Automatically then T z ~ l < an d 

at time there is most a further e(a)(S n -1 + S n ) to accumulate 
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n— 1 


in the integral defining T\ 

p [ T zu-zi> r r 1 ]=p 


as 

yZ 

/e(a) 


2 , /-J.l.) 2 

t. 


/mil — 2 

L J z(i-z) s 




< P[(l — £(a))(S n —2 + S^I-I + 'S'n) < S n —1 + S^] 


= C?(e(a)). 

Continuing in this way we see that the chance of failing to achieve a suc¬ 
cessful coupling is 0(e(a)). 

Now consider the chances of a bad recombination event on the tree when 
we have a successful coupling. We have to consider the lengths of the intervals 
when there are different numbers of lineages extant in the two coalescents, 
but these are the times Tz(\_Z) ~ Pf • We know that these are time intervals 
during which the 2/Z integral must accumulate on the order of e(a). Since 
Z < e(a), the time taken for this is 0(e(a) 2 ). A bad recombination event 
then has probability 0(j^je(a) 2 )- 

The errors that we have so far are the following: 


• Failure to couple: an error of 0(e(a)). 

• Coupling but bad recombinations: an error of 0(j^^e(a) 2 ). 

Since, by Lemma 4.3, the additional error coming from a coalescence of 
C n or V n between T_ z a ^ and T is 0((l/loga) 2 ), the proof of Proposition 4.2 
is complete. □ 


Lemma 4.5. Let Z = (Z t )o< t <cxd be an a-supercritical Feller process gov¬ 
erned by 

dZ = \fTZ dW + aZ dr, 

started in 0 and conditioned on nonextinction, and let T = y z be the tree of 
individuals with infinite lines of descent. Then the following statements are 
true: 

(a) When averaged over Z, y is a Yule tree with birth rate a. 

(b) The number of lines in y z extant at time T (the time when Z first 
hits the level 1) has a Poisson distribution with mean a. 

(c) Given Z, the pair coalescence rate ofy z viewed backward from time oo 
is 2/Z t . 

Proof. Statement (a) is Theorem 3.2 of [19]. Statement (b) follows 
from [7] and the strong Markov property. Statement (c) derives from the fact 
that the pair coalescence rate of ancestral lineages in a continuum branching 
process (be it supercritical or not), conditioned on the total mass path Z, is 
a variant of Perkins’ disintegration result, compare the discussion following 
Theorem 1.1 in [3]. □ 
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Remark 4.6. Part (c) of Lemma 4.5 also reveals that Kingman’s coales- 
cent with pair coalescence rate a 2 /P describes the genealogy behind (1.1). 
Indeed, represent P as 


P, = 


Z 


(i) 


zP + zP 


where Z^ is an a-subcritical Feller process, Z^ is a critical Feller process 
and the time change from t to t is given by 

dr 


dt = 


zP + zP 


Lemma 4.5(c) says that, conditional on Z^ l \ the pair coalescence rate in the 
genealogy of Z ^ is a 2 dr / zP , which equals 


c7 2 dt 


di) 


Z 


( 2 ) 


z. 


( 1 ) 



Proposition 4.7. The variation distance between the distributions of 
the labeled partitions V^ n (introduced in Definition 3.3) and P Vn is 
e>(l/(loga) 2 ). 


Proof. Our proof proceeds in two steps. First we account for the error 
that we make in assuming that there is no coalescence in the sample from 
the leaves of the Yule tree after time T. Second we account for the error in 
considering the process of marks on our Yule tree not up until time T when 
there are a Poisson(a) number of extant individuals, but until the time when 
there are exactly |_ckJ extant individuals. 

(i) Given Z , take a sample of size n from the leaves of y z . Write y z for 
the ancestral tree of this sample and call y z - the cutoff of y z between times 

71 ,1 

t = 0 and t = T. Assume there are N lines at time T. Since their lines of 
ascent agglomerate like in a Polya urn, the proportions of their offspring in 
the leaves of y z are uniformly distributed on the simplex {(pi,... ,pn)\pk > 
0,Pi + ■ • • + Pn = !}■ (See [10] and [13] for more background on Polya urn 
schemes.) Writing Dh = 1 for the number of all descendants of 

individual number h which belong to the sample, one therefore obtains that 
(D i,..., D jv) is uniformly distributed on 

Bn,u ■= {{di, ■ ■ ■, d]y)\di ,..., G No, d\ -\ - \- dw = n}, 

the set of occupation numbers of N boxes with n balls. (This distribution is 
also called the Bose-Einstein distribution with parameters N and n.) Under 
this distribution the probability for multiple hits is 

( N ) ( 1 \ 

( 4 - 8 ) asiV -*°°- 
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Denote by E n the event that there is no coalescence of the ancestral lin¬ 
eages of the sample between times oo and T. Because of Lemma 4.5(b), the 
probability of E n arises by averaging (4.8) over a Poisson(a)-distributed N. 
Consequently the probability of E n when averaged over Z is O(^). 

(ii) Next we estimate the difference which it makes for the labeled parti¬ 
tions if we mark the branches of y z at rate ya/log a (1) between the real 
times r = 0 and r = T or (2) between the “Yule times” i = 1 and i = [aj. 
Since by Lemma 4.5(b) the number of lines extant in y z at time r = T is 
Poisson(a), to complete the proof of the proposition it suffices to show that 
the probability that n chosen lines are hit by some mark between Yule times 
|_aj and J is 0(1/ (log a) 2 ), where J has a Poisson (a) distribution. 

Now on the one hand, by the Chebyshev inequality, 

(4.9) P[|J-a|>a 3 /4]<_^_ = a -i/2 - 

On the other hand, from (3.6) we see that the probability that n lines are hit 
by a mark between Yule times i = \ ot — a 3 / 4 J and i=\a + a 3 / 4 ] is bounded 
by 


1 — exp — 


ny 

logct 



+ o 


i=la—a 3 / 4 J 


(logo) 


for a suitable C > 0. 

Combining this with (4.9) and step (i), the assertion of Proposition 4.7 
follows. □ 


Because of Corollary 3.5 and Propositions 4.2 and 4.7, and since the 
distributions of R 7 " and V Cn coincide, Proposition 3.6 is now immediate. 
□ 


4.4. Within the Yule world: Proofs of Propositions 3.8-3.11 and 2.8. 

Proof of Proposition 3.8. All the results we are going to prove in 
this section deal with a sample of size n taken from the leaves of an infinite 
Yule tree. As a key result we first obtain the “split times” in the sample 
genealogy as time evolves forward from t = 0. Recall from Section 3 that 
I = I (t) is the number of lines of y extant at time t and that K{ is the 
number of lines extant in y n while I = i. 


Lemma 4.8. Ki, i = 1,2,..., starts in K\ 
Markov chain with transition probabilities 

(4.10) ■p[K i+l = k+l\K l = k\ = — t, 


1 and is a time-inhomogeneous 

k + i 
n + i 


P[K i+1 = k\Kt = k] 
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Its backward transition probabilities are 


(4.11) 


P[Ki = k\K i+1 = k + l] 


k{k +1) 
i{i + 1) ' 


The one-time probabilities and the more-step forward and backward transi¬ 
tion probabilities of K are given by 


(4.12) P [Ki = k] 

(4.13) P[Kj = l\Ki = k\ 

(4.14) P[K i = k\K j =l\ 


CU)G) 

rr 1 ) 

(n—k\ (j+k- 1 \ 
\n—l) V i-\-l —1 / 
(n+j-l\ > 

\n-\-i— 1 / 

(w-iKiKi-i) 


(1 < A: < min(z, n)), 


(1 < i, l < j; k < min(i, l)). 


Proof, (i) We begin by deriving the one-step transition probabilities 
(4.10). At each Yule time i, attach to each line of y the label 1 + d, where 
d is the number of the line’s descendants at infinity which belong to the 
sample. Call a line of y fertile if it belongs to y n , that is, if its attached 
label is larger than 1. 

Passing from i = 1 to i = 2, this induces a split of the sample into sub¬ 
groups of sizes D\ and = n — D\, where D\ is uniform on {0,1,... ,n} 
(see the argument in the proof of Proposition 4.7). Given (Di, D 2 ), the pro¬ 
portion of the population at infinity in the tree Y that is descended from 
the line labeled \ + D\ is Beta(l + D\ , 1 + n — D\ (-distributed, the posterior 
of a Beta(l, 1) with D\ successes in n trials. Hence the birth in y at Yule 
time i = 2 is to the line labeled 1 + D\ with probability 1 ^:) 1 . If this is the 
case, D\ is split uniformly into two subgroups, where a uniform split of 0 is 
understood as (0,0). 

At the ith stage of our construction there will be i lines and an associated 
partition of n + i into i subsets with sizes denoted by 

1 + D{,l + Di...,l + Dl 

where D\,..., D\ are nonnegative integers that sum to n. The (i + l)st split 
is of the jth subset with probability At Yule time i, the probability of 

a “true split,” that is, a split leading to two fertile successors of the fertile 
line labeled 1 + Dj. is 

1 + Dj Dj- 1 _ Dj - 1 
i + n Dj + 1 i + n 
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Hence, given that the number Ki of fertile lines at Yule time i equals k, the 
probability of an increase of the number of fertile lines by 1 is 

n — k 
i + n 

Thus, the process is a pure birth process in discrete time, starting 

in K\ = 1, with time inhomogeneous transition probability given by (4.10). 

Formula (4.11) follows from (4.10) and (4.12), which will be derived in 
the next step. 

(ii) Next we derive the one-time probabilities (4.12). For this, take a 
sample of size n from the leaves of the Yule tree Y and look at Yule time i. 
which is the period when there are i individuals extant in Y- Number these 
individuals by h = 1,... ,i and let D k , h = 1,... ,i, be the number of all 
descendants of individual number h which belong to the sample. Then, by 
the argument given in the proof of Proposition 4.7, (D i,..., Di) is uniformly 
distributed on 

B i } n ■ — {(4, • • •, di) | d\ ,..., di £ Neb 4 Y ■ ■ ■ Y di — to}, 

the set of occupation numbers of i boxes with to balls. This distribution is 
also called the Bose-Einstein distribution with parameters i and n. 

The event “there are k fertile lines at time i” thus has the same distribu¬ 
tion as the event “fc of the i boxes are occupied and the remaining i — k are 
empty” under the Bose-Einstein distribution with parameters i and to. 

Let 

B tn := {(4,---,4)|4,---,4eN,4 -f-14 = 4 

It is easy to see that #B i>n = (’ n+l ~ l ) and #B^ n = #B k , n -k = (^). 
Hence, (4.12) arises as the probability under the uniform distribution on 
Bi n that k of the i boxes are occupied and the remaining i — k are empty. 

(iii) From (i) we see that our Markov chain can be represented in the 
following way: At Yule time i there are i real individuals in the Yule tree. 
Additionally, there are n virtual individuals, corresponding to the sample of 
size to, each of which is attached to one of the real individuals. In this way, 
the n virtual individuals are split in k blocks; imagine that each block has 
one of its virtual individuals as its own block leader. Then evolve the Markov 
chain by choosing at each time one of the to + i (real or virtual) individuals at 
random. When a real individual is chosen, it splits into two real individuals 
(leaving the number of fertile lines constant). When a virtual individual is 
chosen, then it is either a block leader or not. If it is a block leader, we find 
a split of a fertile line into one fertile and one infertile line (again leaving the 
number of blocks constant). When a non-block leader is picked, this gives 
rise to a new block and the chosen individual becomes a block leader. This 
then increases the number of blocks by 1. 
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We now proceed to prove (4.13). We start the Yule tree when it has 
i lines, assuming there are currently k blocks of virtual individuals. Thus 
there are currently k block leaders and n — k virtual individuals that are 
eligible to become new block leaders by time j. To distribute the additional 
j — i individuals that enter the Yule tree between Yule times i and j, note 
that these can choose among i + n individuals as potential ancestors at time 
i. Thus the number of ways to distribute the j — i additional individuals on 
i + n ancestors is 




3 


fn+j -1 
V 3 “* 


(n + j - 1\ 
\ n + i — 1 / 


and all of them have the same probability since the additional individuals 
arrive as in a Polya scheme. To obtain l blocks at Yule time j, having k 
block leaders at Yule time i, we must calculate the probability of having 
added l — k blocks up to Yule time j. At Yule time i, there are already k 
block leaders and the number of ways to choose l — k additional ones from 
the remaining n — k potential new block leaders is 


n — k 
l-k 


To realize these new block leaders, we already have to use l — k individuals. 
The remaining j — i — (Z — k) individuals must be distributed among the 
i + n individuals present at Yule time i. However, to obtain l blocks at Yule 
time j , these individuals must avoid the n — l nonblock leaders (because this 
would result in new blocks). The number of ways to distribute j — i — (l — k) 
balls on i + n — (n — l) boxes is 


i_ (i—k) 


f j+k-1 \ 

\j-i-(l-k)J 


( j + k- 1\ 

)' 


Altogether we arrive at (4.13). 

(iv) The more-step backward transition probabilities (4.14) follow from 
(4.12), (4.13) and Bayes’ formula. □ 


Remark 4.9. (i) Here is a self-contained derivation of the more-step 

backward transition probabilities (4.14). For i,l<j, k< min(z,Z), consider 
the classical Polya model with i ancestors and j — i newcomers, leading 
to j people after the successive arrivals of the newcomers. Each of these 
j — i newcomers joins the family of one of the i ancestors by randomly 
choosing one of the extant individuals. The joint distribution of the numbers 
of newcomers in each family is then Bose-Einstein with parameters i and 
j — i. There are now j people in our population. Sample l at random (without 
replacement). Then the conditional probability P [K{ = k\Kj = Z] equals the 
probability that these l people belong to exactly k families. We can now 
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decompose with respect to the number U of individuals in the /-sample 
which are among the original i ancestors. This number is hypergeometric, 
choosing l out of j = i + (j — *) [see (2.6)]. 

Given U = u, the probability that the sample forms exactly k distinct fam¬ 
ilies is the probability that exactly k — u of the available i — u ancestors have 
descendants among the l — u newcomers in the sample. There are (ifit.) possi¬ 
ble choices of the additional ancestors and, conditional on that choice, there 
are (~ V iZ^d'k~ 7 i)~ *) ways to distribute the remaining newcomers. Hence the 
conditional probability that the sample forms exactly k distinct families is 

(i-u\(l-u-{k-u)+k- 1\ /i-u\/i-l\ 

\k-u) \ l-u—(k-u) ' _ V j-k) \l~k) 

d—u+i— 1\ d—u-\-i—1\ 

V l-u / V %-1 ) 

We thus conclude that 

(4.15) P [Ki = k\Kj = l] = J2 ft 

The fact that the right-hand sides of (4.14) and (4.15) are equal can be 
checked by an elementary but tedious calculation (or by your favorite com¬ 
puter algebra package). 

(ii) Since (4.12) follows from (4.14) by putting / = n and letting j tend 
to infinity, (4.13) follows from (4.14) and (4.12) by the Bayes formula, and 
(4.10) and (4.11) specialize from (4.13) and (4.14). 

With Lemma 4.8 it is now easy to complete the proof of Proposition 3.8: 

P[F <i] = P\Ki = n\ 

( n+ n _1 ) {i — n)\{n + i — 1)! 

(i — 1) ■ ■ ■ (i — n + 1) 

(i + n — 1) . (i + 1) ’ n 

Proof of Proposition 3.9. As a preparation, we prove the following 
lemma: 


Lemma 4.10. For the first Yule time F when there are n fertile lines 
we have, for k <n, 

(/-*-l). {f-i-(n-k) + 1) 


(4.16) 


P[F = f\Ki = k} = 


P[F = f\K i = n-l\ = 


(f + n- 1). (f + k- 1) 

x (n — k)(n T i — 1), k<n— 1, 

1 


(f + n- l)(/ + n-2) 


(n + i-l) 
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and 

Ci 

(4.17) P[F = f\Ki = k]<ji, 

where C depends only on n. 


Proof. Equation (4.16) follows by 
P[F = f\Ki = k] 

= P [F>f- 1| Ki =k\ — P[F > f\Ki = k\ 

= P [K f = n\Ki = k] - P [K f _x = n\Ki = k] 

(f+k—l\ ff+k— 2\ 

_ \i+n— 1/ \i-\-n— 1/ 

_ (n+f~ 1\ (n+f- 2 \ 

Vn+z—1/ Vn+z—1/ 

(/ + fc-l)!(/-Q! _ (/ + fc — 2)!(/ — i — 1)! 

(f — i + k — n)!(n + /— 1)! (/ — i + fc — n — l)!(n + / — 2)! 

= {(/ + fc-2)!(/-i-l)! 

x ((/ + * -!)(/-*)-(/-* + - n )( n + / - !))} 
x {(/ — i + k — n)!(n + / — l)!}^ 1 

(f + k — 2)!(/ — i — l)!(n — fc)(n + * — 1) 

(/ — * + k — n)\(n + f — 1)! 

where we have used 


(f + k- 1)(/ - i) -(/-* + A: - n)(n + / - 1) 

= (/ - *)( fc - n) - (k - n)(n + / - 1) 

= (k — n)(f — i — n — f + 1) = (n — k)(n + i — 1). 

If k < n — 1, the terms (/ — * — !)! and (/ — i + k — n)\ cancel partially, 


leading to 


P[F = f\K i = k] 


(f-i- 1). (f-i-(n-k) + 1) 

(/ + n — 1).(/ + A: - 1) 


A:)(n + i — 1). 


In the case k = n — 1, these two terms cancel completely, which proves (4.16). 

To see (4.17), note that the fraction in (4.16) is bounded by 1 /f 2 and 
(n — k){n + i — 1) < n 2 i. □ 


Recall the definitions of Mi, M = J2f=i^i and S ^ from Section 3. We 
first turn to the proof of (3.5). 
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First observe that since {M > 2} requires either that the tree is hit by 
marks during two distinct Yule times in the early phase or at least twice 
during a single Yule time (during the early phase), we may estimate 

N L«J H 

P[M > 2] < EE P [Mi > 1 ,Mj > 1, Kj < n\ + p [Mi > 2]. 

i=nj=i +1 i=n 

For the first term we use (4.17) to see that there is some constant C that 
depends only on n such that, for k <n, 


P[F>j\K i = k\<Cj. 

With this we approximate (recall Remark 3.7) for constants C which depend 
only on n and can change from occurrence to occurrence: 


N H 

E E P [Mi — 1, Mj > 1, K 3 < n] 

i= 1 j=i+ 1 

[aj |aj n—l 

<EE E P[M j = l\K j = k 2 ]-P[M i = l\K i = k 1 ] 

i=1 j=i+ 1 fci,fc2=l 

xP [K i = k 1 ,K j = k 2 \ 

L“j L“J "- 1 ^,2 2 q 

2 2 H L a J n—l 

■ E E E - p l K i < « - l \ R i = fc i]■ p [Ki = h] 


< 


< 


< 


< 


(log 

q L«J N ^ n_1 


E E ^'Z p i F> 3\ K i = ki]-'p[Ki = k 1 ] 


(log«) 2 ij 

r L a J -I ■ 

, 2 V V TT vP[F > *] 
(log 

(j H L“J ^ 


(log^SiSr ij2 


c El i 


EtI< 


c 


(logo) 2 i 2 (loga) 2 ' 
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For the second term, again by Remark 3.7, 


H H 

E p [ M *> 2 ]<E 


i= 1 


i=l 


c 

i 2 (\oga ) 2 


< 


C 

(log a) 2 ’ 


which proves (3.5). 

Consequently, we can now concentrate on the event {M = 1}. The proof 
of (3.4) consists of two steps. First we will show that 


P[S y = s,M=l] 


(4.18) 


= o 


(log a) 


7 


+ 


Laj 

y 

loga^ 

7 


L«J 

E 


in— s+z—2\ 
^ n—s ) 


rt 1 ) ’ 

in+i— 3 \ li—l\ 

V n— 1 / \n—l) 


log a .. ( n+2 lN ) 

° i=l \ n / 


s > 2, 


s = 1. 


Second we will approximate these probabilities to obtain (3.4). 

Given that the sample genealogy is hit by exactly one mark, and given 
that this happens when there are k fertile lines (1 < k < n — 1), then the 
(conditional) probability that S y = s (1 < s < n — k + 1) is (in the notation 
of the proof of Lemma 4.8) 


P [S y = s\Ki = k, M = Mi = 1] 

(4.19) 


z ft z -^k—l,n—k—(s—l) 

~H~ k ,n—k 

/n-k—(s—l)-\-(k—l) — l 
V n—k—(s— 1) 

(n—k+k— 1\ 

V n—k ) 


( n—s —1 \ 

\n—s—(k— 1)/ 

C=D 


[Indeed, the one line which is hit by a mark during period i must spawn s — 1 
offspring and the other k — 1 lines must spawn n — k — (s — 1) offspring.] 
The probability that the sample genealogy is hit by a mark during period 
i and there are k fertile lines [k <i A (n — 1)] during period i is, according 
to Remark 3.7 and (3.6), 


P [Mi = 1, K{ = k\ = 




a i 


i 2 (loga) 2 


P [Ki = k] 


(4.20) 


7 kOQzi) 


n+i —1 
n 

7 (fc-Dfnlfc ) 

logo (E _1 ) 


log a i ( n+ ^ x ) 

(i —1 \ in —1 


+ o 


+ o 


i 


i 2 (loga) 2 

1 

i 2 (loga) 2 


To now show (4.18), we need some approximations. 
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Lemma 4.11. For constants C\ and which only depend on n, 

Ci 


[ctj iA(n— 1) 

(4.21) E E (log*)P[-fQ = k, Mi = 1] < 

i =1 k =1 

(4.22) P [M > 2 |Mi = 1, IU = k\ < 

Proof. We will use the integral 

f log x 1 + log x 


log a’ 

C 2 (l + logi) 
log a 


(4.23) 


x* 


which will give us finiteness of some constants. 
For (4.21), we have, from (4.20), 

|_aj iA(n—1) 

E E (logi)P[Ki = k,Mi = l] 

i= 1 k=1 


= o 


= o 


i 


+ 


7 


H 


iA(n— 1) /i-lvn-1 


E lo g* E 


i—1\ fn—l\ 
k—l) \n—k) 


(loga)V log ( n+ n *) 


+ 


7 


H 


■E lo s 


(log a) 2 ) log 


(n+i— 2\ _ /i—1\ 

• v n— 1 / \n— 1/ 

(Ef 1 ) 


Now, for some constants C and C' which are bounded in i (and may change 
from appearance to appearance), 

n + i- 2 \ / % - 1 ^ _ [(* + n - 2) • • • i] - [(* - 1) • • • (i - (n - 1))] 

n —1 ) In—1 


< 


(n — 1)! 

i n ~ x + Ci n ~ 2 - i n ~ 1 + C'i n ~ 2 
(n — 1)! 


<Ci 


n —2 


and, because for i > 2 


n + i — 1 


n 


>Ci n , 


[aj jA(n—1) 

E E (}ogi)P[K i = k,M = M i = l} 

i =1 fc=l 


<o 


1 


7 E(7i n - 2 logi C" 
+ ^E - =r-2-< 


(logo) 2 / loga—( Ci r 


log a’ 


we see that 
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where we have used (4.23). 

For (4.22), using (4.17) and (4.23), we write, for some constant C depend¬ 
ing only on n which can change from instance to instance, 

P[M>2\M i = l,K i = k] 

H 

(4.24) ='£ i P[M>2\M i = l,K i = k,F = j] ■ P [F = j\K t = k] 

j=i 

< n 7 Ci ^ C E ilog j ^ C(l + log?) n 

Hoga j 2 ~ loga“ j 2 “ log a 


(4.25) 


We now return to the proof of (4.18). The probability that the sample 
genealogy is hit by a unique early mark and that S y = s < n is 

P[S y = s,M= 1] 

iA(n—1) [oj 

= E Y, p l sy = 8 > M = M i = 1 ’ K * = k \ 

fc=1 i =1 
»A(n—1) [oj 

= E E p [^ = s l M=M * = 1 ’^= A: ] 

fc=1 i =1 

(4.26) 

x P [Mi = 1. Kj = k](l-P[M > 2| Mi = 1 ,Ki = k]). 

For this sum, the event {M > 2} does not play a role, because by (4.21) and 
(4.22), 

iA(n—1) [aj 

E E P t M * = ' p [ M > 2| Mi = 1 ,Ki = k] 

(4.27) 


k =1 i=l 

H 


logo; (logo) 2 

So, combining (4.19) and (4.20), we have 
P[S y = s,M = l\ 

i/\(n— 1 ) [aj 

= o(^^)+ e E p [^= s l M = M * = 1 >^ = ^ 


(log a) 


(4.28) 


= C> 


+ 


/c = l 2=1 

x P[Mj = 1, I\i = k] 

I a I iA(n—1) / n-s-1 \/i-l\ 

7 In—s—(fc—1)/ Vfc—1/ 


E E 


(log«) 2 ; loga^ (^n ') 
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= o 


= o 


- v I ck I (n—s+i— 2\ / 2 —1 \ (n — s —1\ 

2_'j + _J_ i n-s )~ L-lH 1-a > 


(log a) 2 ) log a ^ 


(log a) 


+ 


rr 1 ) 

\_a\ /n—s+i— 2\ 

V^ V n—s ) 

rr 1 ) ’ 

[aj /n+i—3\ /i—1 \ 
f \ ' \ n—1 / Vn—1/ 


7 


;E 


loga^ ( n+ * X ) 


s > 2, 

s = 1, 


which proves (4.18). 

We now approximate these probabilities further to obtain (3.4). First 
observe that, since 


(n + i — n — 1 \(n + i — 2) • • -i (n + i — 1) •■•(* + 1) 

for s = n and n > 2, we may write 


N 


N 


(4.29) 


y^___= n \ y^_ 

Z_, m+i-n 'f-'fra + i-l) 

2=1 V n J z= 1 v 7 


n! 


(n — 1) • (n — 1)! 


+ 0 (- = 


n 


a/ n — 1 


+ 0 - 


which gives (3.4) in the case s = n. 
Now define 


(4.30) 


|_a:J /n— s+ 2 —2\ 

A(n,s,a) 


/ -< m+i— 1 \ 
i=l l n > 

For s < n — 1, the summand vanishes for i = 1 and so 
[aj—1 m—s+i—1\ L“J-1 


= E 


n\i\[n — s + i — 1)! 


(4.31) 


A(n,s,a) E ( n+i-j (ra + i)!(n-s)!(i-l)! 

n! 


-E- 

(n — s)! — {i A n) ■ ■ ■ (i + n — s) 

i! L«J-i 


n! 


(n — s)! ^ V(* + n) • • • (i + n — s + 1) 

n — s 

(i + n) ■ ■ ■ (i + n — s) / 


1 
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We treat the two sums separately and rewrite each as a telescoping sum as 
in the derivation of (4.29) to see that, for 2 < s < n — 1, this gives 

n\ ( 1 1 n — s 1 

A[n, s, a) = 


(n — s)! \s — 1 n ■ ■ ■ (n — s + 2) s n---(n — s + 1) 

+ o(- 

.a 


n — s + 1 n — s 


s — 1 


+ o 


(n — s + l)s — (s — l)( n — s) 1 


n 


(s-l)s 

+ o(- 


(s-l)s 

so (3.4) is also proved for 2 < s < n — 1. For 8 = 1, the above gives 

L a J 1 1 L°d i / -j -j 

A(n,l,a)=n y - -n(n - 1) y (—-- - —- 

z + n r—f \t + n — 1 i + n 

1=1 1=1 


(4.32) 


L«J i 1/1 

= n 'S'' -n(n— 1)—b O 

. i n 

i=n-\-l 

L«J ^ ^ 

= 1 — n + n Y —b O 

. ,. i V« 

i=n +1 


For s = 1, we also have to deal with the second term in (4.18). We write 

v! (* — 1) • • ‘ (* — ri + 1) 

= ny 


L“J G— 1\ |aJ 


y- (n-l) = y(__(* “ 1)!(» ~ I)!™! 


( n y x ) “(n-l)!(*-n)!(n + *-l)! “ (* + n-l)---i 

Dehne 

M (i - 1) • • • (i - m+ 1) 

A m ,n '■— / J r. : TT : 1 7TT. > 1, 


“1 (t + n-l)-"i 


H 




—( (i + n - 1) ■ 


4 •— 4 


Our goal then is to find an approximation of A n . Observe that we have a 
recursive structure: 

_ (*-!)■■■(* ~m + 2) 

^-m,n — / ^ 


z=l 


(i + n — 2) • • • * 
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(4.33) 


rnV' (*-l)-”(* ~m + 2) 
— [m + n — 2) y ■ 


<=1 (i + n- 

= Am— l,n—1 ^ . 

From this equation it also follows that 

m—2 

A m ,n = Ai n—m+1 4“ ^ ) {A m —k,n—k A m —k— l,n—fe—l) 

, „ fc =0 

(4.34) 

m—2 

— Al,n— m+1 ^ ) {jkl 4“ 71 2k 2) 


k =0 


First we show that for 1 < m < n, 
(4.35) 


(m — l)!(m — l)!(n — m — 1)! 
s*-m,n ~ 7 7T77 TTi I - ^ 


(n — l)!(n — 1)! 

We proceed by induction on m. For m = 1, we have, up to an error of 0(-^) 
(here n > 1 is important), 

L«J 

Al > n = ^71 


—( (i + n - 1) 


H 


—y---i- 

— 1 “ (i + n — 2) ■■ ■ i (i + n — 1) ■ • ■ (i + 1) 


(n — l)(n — 1)! 
(n — 2)! 


+ o[ - 

a 


+ o 


(n — l)!(n — 1)! 

and, by (4.33), 

Am+l,n — Am,n—l (ni "bn l)^4 mn 

(m — l)!(m — l)!(n — m — 2)! 

“ (n — 2)!(n — 2)! 

_ i\ ( m ~ 1)K to - l)!(n-m- 1)! 

1 + ’ (n — l)!(n — 1)! 

(m — l)!(m — l)!(n — m — 2 )!((n — l) 2 — (n — 1 + m)(n — 1 — m)) 


(n — l)!(n — 1)! 


m\m\{n — (m + 1) — 1)! 
(n — l)!(n — 1)! 
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which proves (4.35). 

From (4.35) and (4.34), we see, because 


L“J 


(4.36) 

that 


1 


a 1 = J2- + o(- 

t— f i \a 


2=1 


n —2 


A =A , 2 V(n fc i) ( n ~ fc ~ 2 ) ! ( ra ~ fe ~ 2 ) ! | gf 1 

n (n — k — l)!(n — k — 1)! \a 


k =0 
n—2 


(4.37) = + -2^ 


k =0 


n — k — 1 


a 


n— 1 


+ Ol-)=A 1 - 2j2- + 0{- 


2=1 


a 


L«J n ~ 1 1 


= --i + E --Y,- + ° - 

n l= r+i ! i=2 * v- 

Now (3.4) follows in the case s= 1 from (4.18), (4.32) and (4.37) as 


|aj m+i—3\ _ fi— 1\ 
V n-1 ) \n—l) 


H 


E 

2=1 


1 


rr 1 ) 


= 1 — n + n ^ — 


2=72+1 


H ^ n_1 


— n 


n 


-'+ E 7-E- +° 


i=n+l 


i =2 


a 


n_1 1 /1 

= »Ej + + 

1=2 1 


□ 


Proof of Proposition 3.10. Let C denote the random subset that 
consists of all those ancestral lineages of the sample which are hit by a late 
mark. For a fixed subset A of the n ancestral lineages of the sample, we 
conclude from Remark 3.7 and (3.6) that, for all / € (1,..., |_aij }, 


H 


la 


P[£ nA = 0 \F = /] = JJ — , 

ia + aja/loga 


(4.38) 


= ex P 


07 E+® 


log a + i 

i=f 


(log a) 2 


= <">* +0 (<Si; 


;a)V’ 

where a = #A and the error term is uniform in /. Consequently, if we con¬ 
sider the random subset Ai of the sample which results from the successes 
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of coin tossing with random success probability pp, we observe that 

P[£nA-0]-E[(p,H +O ( 5 J s5i ) 

(4.39) 

= P[Xn a = 0] + O ( is T ? ). 

By inclusion-exclusion, (4.38) extends to the desired approximate equality 
of the distributions of C and M. □ 


Proof of Proposition 3.11. It remains to prove the approximate 
independence of the random variables and L?. It is enough to show 
that, with the desired accuracy, S y is independent of the event of a late 
recombination of a randomly chosen line. 

For convenience, we abuse notation, and write S and L instead of and 
I?. The approximate independence of the distributions of S and L relies 
on two crucial observations. First, because S > 0 only has a probability of 
order 0( P 7 - 77 ) (see Proposition 3.9), we can allow for a multiplicative error 
of order O(jj-f^) for the probability of L = /. The second observation is that 
the two probabilities P [L = l\ and P [L = £ |Kj = k] are apart, which is 

the content of (4.41). 


Lemma 4.12. There are constants C\ and C 2 that depend only on n 
such that 

(4.40) |PjX = l\F = /] - P [L = l\F = /']|< C, ^ 1 + 1 °f 0 g^ 1 + l0g ^) » 

(4.41) |P[L = l\Ki = fcl — P[L = /]| < ^(l + logt) 

logo 


Proof. We start by proving (4.40). Given F = /, the number of late 
recombinants is approximately binomially distributed with parameters n 
and 1 — pf (see Proposition 3.10). Thus, for f,f < a, 


P[L = l|F = /]-P[L = !|F = /']| 

= (") |(1 - (1 -P/0'PJT'I + O 


1 

(log a ) 2 



I P 


n—l+k _|_ q 


Pf 


(log a) 
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where we have used that (1 — p) 1 = Y^k=o Ck)P k ■ Now (4.40) follows, since for 
0 < m < 2n and (w.l.o.g.) / < /', 



C(l + log/Q 
log a 


To show (4.41), we calculate directly. For some C (which may change from 
occurrence to occurrence) we obtain, noting that P[F > |_a_|] = O(-) from 
(4.17) and, for i < /, K{ and L are conditionally independent given F = /, 


\P[L = l\Ki=k\ -P[L = l}\ 

L«J 

= ^(P[L = l\F = f ] • P [F = f\Ki = k] - P [L = 1} ■ P [F = f\Ki = k}) 
f=i 

+ o(- 

\a 

L«J L«J 

f=i f'=n 

x P [F = f'](P[L = l\F = /] - P [L = l\F = /']) 


+ o[ - 

a 


1 1(1 + log /)(1 + log/') 

— L f'2 ft 2 

f=if’=n J J 


logct 


c 


L“J „• 


E 


i(l + log /) C(l + logi) 


lo g a U 


P logo 

where we have used again (4.17) and the integral (4.23). □ 


With the help of the previous lemma, we can now complete the proof of 
Proposition 3.11. First take s > 0. Then the assertion follows from the above 
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statements, since by (3.5), 
p [L = l,S = s] = P[L = l,S = s,M = l} + 0 


(log a) 


|_aj if\(n— 1) 

= E E P[L = l\S = s,Ki = k,M = Mi = l] 

2 — j 

x P[S = s,K{ = k,M = Mi = 1] + C > ( (log 1 a)2 ) 

[aj iA(n—1) 

= E E P[L = l\K i = k]P[S = s,K i = k,M = M i = l] 

i =1 fc=l 


+ o 


1 


.(log a ) 2 

, „ I a I iA(n— 1) . , 

(4 = 4) E E (p[L = Z] ' C ° gZ 


i =1 fc=l 


( 4 iT 1 ) p[L = ^] -P[5 = s] + 0 


logo: 

x P[S = s, Ki = k, M = Mi = 1] + O 

1 


(logo ) 5 


.(log a ) 2 

Also for s = 0 the assertion is true, since by the above 
P[L = l, S = 0\ = P[L = 1} - P[L = l, S > 0\ 

= P[L = l](l-P[S>0]) + O 


= P[L = l\ -P[S = 0] + 0 


(logo) 


(logo ) 2 


□ 


Proof of Proposition 2.8. Schweinsberg and Durrett [21] also used 
Yule processes to obtain an approximate scenario for the genealogy under 
hitchhiking. In fact, the third step in the approximation in our paper (see 
Sections 3.2 and 3.5) leads to the very same Yule process that appeared 
in [21]. To be exact about this, note that by Remark 3.7, one line in our 
Yule tree is hit by a mark during Yule time i with probability 


P 


7 


1 


ia + p logo i + 7 /log a 


7 


log a\i 


7 + 0 - 


1 


i 2 log a 
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In the model of Schweinsberg and Durrett [21] as given in their (7.2), a line 
is hit during Yule time i. as p = 2 Nr, a = 2Ns, with probability 

r p 

is + r(l —s) ia + p(l — s) 


7 

log a 



1 

i 2 log a 


which proves the proposition. □ 
4.5. The sampling formula. 


Proof of Corollary 2.7. Using Theorem 1, we calculate, because 
L and S are independent, 

n 

P [E = e,L = l} = p i E = e\L = l,S = s ] • P [L = l,S = s] 

s=e 


= P [L = l]J2 


©( 


n—s \ 
n—l—e) 



P[5 


= s 


=m^-^)‘]±( s e )( n n _- l i e )ns=si 

Let us now distinguish the three cases e > 2, e = 1 and e = 0. By a calculation 
using binomial coefficients (or using a computer algebra program or by [20] , 
page 7, (2)), 



^ 71 l 1 j t{l + e < n}. 


With this we can calculate for e > 2, as long as l + e < n, 



n — s 
n — l — e 



nj I 1 
log a In 



t{l + e = n 




s(s-l) 


ny / n — 1 
log a l e(e — 1) 




t{l + e 


n) 



n 7 (n-l)( n e _l)t{l + e = n} + ( n l 1 ) 
log a e(e — 1) 
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where we have used + E for the case s = n. This gives (2.7) in 

the case e > 2. For e = 1, we have 


EG 


S=1 


n — s 
n — l — 1 


P [S = s] 


n'y 

logo 




i =2 s=2 

which gives the result for e = 1. For the case e = 0, we first calculate 

n 

P[S = 0] = l-'£P[S = 8] 


s =1 

n —1 -i n 

- + E- + E 

71 1^2 1 7^2 

n—1 -j 


1 


»(»-!) 


logo r~f i 

° i=i 


With this we see 

±(:z s ,)ns= 


s =0 


^ n—1 N 

i-^E 


log a — i 

° *=i 


+ 


wy (1 


log a \ n 


, ^ x n—1 n 

n«=-> + n 1 e 


n — s 


i=2 l 


1 - 


717 

log a 


!--E- 


+ ^L(I 1{/ = n} + ^(-- 

log a y n —' n — 


n — s 

n-l J s (s - 1) 


which completes the proof. □ 
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